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Abstract 

The problem of finding periodically expressed genes from time course microarray 
experiments is at the center of numerous efforts to identify the molecular compo- 
nents of biological clocks. We present a new approach to this problem based on 
the cyclohedron test, which is a rank test inspired by recent advances in algebraic 
combinatorics. The test has the advantage of being robust to measurement errors, 
and can be used to ascertain the significance of top-ranked genes. We apply the 
test to recently published measurements of gene expression during mouse somitoge- 
nesis and find 32 genes that collectively are significant. Among these are previously 
identified periodic genes involved in the Notch/FGF and Wnt signaling pathways, 
as well as novel candidate genes that may play a role in regulating the segmentation 
clock. These results confirm that there are an abundance of exceptionally periodic 
genes expressed during somitogenesis. The emphasis of this paper is on the statis- 
tics and combinatorics that underlie the cyclohedron test and its implementation 
within a multiple testing framework. 
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1 Introduction 



The search for the molecular components of biological clocks is an important first step 
towards understanding the regulatory mechanisms underlying periodic behavior at the 
molecular level. Examples of clocks that have been studied include the circadian clock 
McDonald and Rosbash (2001)|, the respiratory cycle clock in yeast Klevecz eta/. (2004) 



Spellman et.al. (1998)] and the segmentation clock in vertebrates Pourquie (2003)] . In 



order to find clock-related genes in a high-throughput fashion, time course array exper 
iments are performed to measure the expression levels of genes on a genome-wide scale. 
This is followed by a statistical analysis to find periodically expressed genes. The anal- 
ysis is non-trivial for reasons that include noisy measurements, variable times between 
experiments, vague notions of periodicity, and loss of power due to multiple testing. 

The question of how best to analyze cyclic time series is a topic of extensive re- 
search in statistics Chatfield (1978)| . Recent approaches, proposed in the context of 



microarray analysis include splines and other curve approximations Luan and Li (2004) 
Storey et.al. (200"5)] , methods based on signal processing techniques such as the Lomb- 



Scargletest [Glynn et.al. ("2006) | , and non-parametric rank tests Willbrand et.al. (2005)] . 
All of these methods address, to varying degrees, the difficulties outlined above, and are 
sometimes developed in response to specific needs dictated by individual experiments. 

In this paper we introduce a new test for finding periodic genes. Our method belongs 
to the family of convex rank tests in Morton et.al. (2007), Section 5]. These tests were 
inspired by up-down analysis, the method of Willbrand et.al. (2005)] . They are based 
on recent advances in algebraic combinatorics, namely the theory of graph associahedra 

Markl ( 1999)[. 



Fomin and Reading (2004) , Hohlweg and Lange (2005) 



The connection 

between rank tests and polytopes was first suggested in Cook and Seiford (1983)[ . When 
using rank tests, an expression time-course is represented by a permutation. This has the 
advantage of providing robustness to noise, monotonic transformations, and uncertainty 
with respect to the underlying probability distributions, and the disadvantage of preclud- 
ing a parametric analysis of the untransformed time courses. In up-down analysis, each 
permutation of {1,2, . . . ,n} is mapped to a sign vector, or signature, that records, for 
each adjacent pair on the n-path, which of the two measurements is higher. Significance 
is determined by counting the number of permutations that have an observed signature. 

Our cyclohedron test is based on a similar permutation count to that of up-down anal- 
ysis, but the data points are now compared at longer range along the edges of the n-cycle. 
The cyclohedron C„ is the graph associahedron when the graph G is the n-cycle, and the 
cyclohedron test is the greedy method for linear programming on C„. It is equivalent 
to the test denoted by T^j-fj) 



m 



Morton et.al. (2007), Section 5]. Cyclohedra are also 



known as Bott-Taubes polytopes, and they play an important role in representation the- 



ory |Fomin and Reading (2004) , Section 3.2], combinatorics Hohlweg and Lange (2005) 
Sandman (2004)] , and homotopy theory [Markl (1999)]. Connect ions to statistical learn 



ing theory were explored and developed in [Morton et.al. (2006) Morton et.al. (2007)[ . 

The cyclohedron test is explained in detail in Section [2l Our presentation is elementary 
and self-contained. In Section 3 we present a method for assigning p-values to top-ranked 
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groups of genes. This is done within a multiple hypothesis testing framework, which 
is compatible with any rank test for permutation data, including up-down analysis. In 
Sections 5 and 6 we develop the combinatorial details and efficient algorithms for the 
cyclohedron test. Our R code is available online, and its use is described in the Appendix. 

We apply the cyclohedron test to data reported in Dequeant et.al. (2006)| , consisting 
of 17 distinct expression array experiments from the presomitic mesoderm tissue of mouse 
embryos. These data were chosen because of the analyses already undertaken and the 
possibility for biological validation. Results are discussed in Section 4. We find that 
although the high-throughput array experiments are effective for finding groups of genes 
likely to be involved with clock regulation, multiple testing issues preclude the assignment 
of significance to any individual gene on the basis of periodic-looking patterns alone. 



2 The cyclohedron test 

The cyclohedron test is appropriate when seeking to determine whether a time course 
expression is periodic. Within a single hypothesis setting, the null hypothesis states that 
a gene or other unit of interest does not exhibit cyclic expression. The cyclohedron test 
provides a test statistic, which we call the permutation count, that replaces this vague 
null hypothesis. The test applies to data vectors v = {vi, . . . ,Vn) whose coordinates are 
distinct real numbers. The coordinates Vi are measurements of the same quantity at 
distinct points. In our applications, the ordering of each vector should be with respect to 
some 'cyclic' time, so that any v' = (f j, fj+i, . . . , f„, f i, . . . , is an equally meaningful 
ordering. For example, the data vectors v we analyze in Section H] are ordered within a 
somite-formation cycle; so vj is a measurement taken before f j+i in the cycle, where j + 1 
is understood mod n. 

The following procedure computes, for any given data vector v, its signature a{v) and 
its permutation count c(t>). The signature is an unordered set a = {o"i, o"2, . . . , cr„-i} of 
subsets of {1,2, ... ,n} and the permutation count is a positive integer. 



Algorithm 2.1. (Cyclohedron test) 

Input: A vector v = [vi, . . . , f„) of distinct real numbers. 

Output: The signature cr = {cti, (72, ... , and the permutation count c for v. 

Initialize c := 1. 

For i from 1 to n—1, do 

Initialize cxj = 0, the empty set. 

Let 6i be the unique index such that w^. is the i-th largest coordinate of v. 
Initialize Left := and Right := 0. 
For k from 1 to i—1, do 

if (Jfc contains 5j — 1 (modulo n) then set Left := cr^, 

if (Jfc contains 5j+l (modulo n) then set Right := cr^. 
Set a, := {6,} U Right U Left and c := c ■ (l^^^l^l+Mtl) . 
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Let Cn denote the set of all signatures a{v) as v runs over M". Algorithm 12.11 constructs 
not only the signature a and the permutation count c but also the descent order permuta- 
tion S = {Si, 62, ■ . . , Sn) of the data vector v = (f 1, t>2, . . . , f„). Since a{v) depends only on 
the descent order permutation 6, our algorithm specifies a map 6 ^ a from the symmet- 
ric group S„ onto the set C„. For n > 4, this map is not injective, and we are interested 
in the cardinalities of the preimages. For instance, the permutations 6 = (1,3,2,4) and 
6' = (3,1,2,4) have the same signature a{5) = a{5') = {{1}, {3}, {1, 2, 3}}. 

The test statistic, the permutation count c, is the number of permutations having the 
same signature as the permutation of interest. Significant data vectors have small test 
statistics, because it is unlikely that a random permutation will have a topographical 
map shared by few permutations. The permutation count c = c{v) has the following 
interpretation. Suppose that an appropriate null data generating distribution for each 
data vector v induces the uniform distribution on all descent order permutations 6 in the 
symmetric group E„. Note that this assumption is valid if the coordinates of the data 
vector are independent and identically distributed under the null distribution, so our test 
is therefore broadly applicable. For each signature a G C„, let p{a) denote the probability 
that the signature a would be observed under such a null distribution. The following 
proposition states that p{a) is the fraction of permutations 6 that map to a. 

Proposition 2.2. The permutation count c computed by Algorithm \2.1\ depends only on 
the signature o. It equals the number of permutations 6 that are mapped to a, and hence 

c = c(cr) = p{a) ■ n\. 

Proof. For each cij in the signature a, at most two other sets aj and are contained 
in (Tj and are maximal with this property. Here aj and ak are necessarily disjoint. The 
permutation count c is the product of the corresponding binomial coefficients (''^j^l''')- 
It depends only on a. The second statement is proved by induction on n, using the fact 
that any valid permutation of aj can be shuffled with any valid permutation of a^, and 
augmented by 6i, to get a valid permutation for (jj. Carrying out this process until i = n, 
with (T„ = {1, 2, . . . , n}, yields precisely all permutations 6 that have signature a. □ 

The other output of the algorithm, the signature cr{v), can be viewed as a topographic 
map on the n-cycle that captures the shape of the data v. Algorithm 12.11 is an iterative 
procedure for drawing this topographic map. Namely, we encircle the vertices of the n- 
cycle in decreasing order of their corresponding data vector coordinates, that is, in the 
order 61,62, ... ,6n~i- (The first circle is the set cxi, the second is (J2, and so on.) We 
do this according to the following provision: in order to encircle 6i, if it is adjacent to 
some vertex j which has already been encircled by some ak, then cxj must contain the cr^ 
circle. Accordingly, the sets "Left" and "Right" keep track of how far to the left and right 
(Ti must extend. The result is an unordered set a of n—1 encircled sets cxi, o"2, . . . , cr„_i. 
Figure 1 displays the beginning of an example of this encircling process for n = 11. 

We say that the height hi of the i-th vertex in the topographic map for v is the 
number of sets aj which contain i. We can identify the signature a with the height vector 
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Figure 1: Algorithm 12.11 constructs a topographic map on the n-cycle by subsequently 
encircling vertices in order of decreasing size of the corresponding component of a data 
vector. Displayed at the top are the formations of the first two components cTj, and at 
the bottom are the third and fourth, of the signature for an example with n = 11. 

h = {hi, h2, . . . , hn), because a can be recovered uniquely from the vector h. The map 
V (— > h{v) can be viewed as a smoothing of the data; see Figure [2l 

Remark 2.3. The cyclohedron test applies when there are no ties Vi = Vj in the data. 
When ties occur, we examine all possible permutations 6 arising from small perturbations. 

Example 2.4. In our analysis in Section 4, the number of microarray experiments is 
n = 17, and the number of probesets (labels of the data vectors) is = 13, 873. The 
probeset ranked first in Table 1 represents a gene named Obox. Its data vector equals 

V = (0.738,0.996,0.705,0.150,-0.566,-0.673,0.774,-0.736,-0.788, 
-0.802, -1.276, -0.521, 0.238, -0.258, -0.249, -0.084, -0.117). 

The descent order permutation for this vector v equals 

S = (2,7,1,3,13,4,16,17,15,14,12,5,6,8,9,10,11). 

The signature a is given by the unordered set ai = {2}, a2 = {7}, (Xa = {1,2}, 
(T4 = {1,2,3}, etc. The permutation count c = 480 is the product of the three con- 
tributions made by 5, 8 and 12, respectively, when constructing erg = {1,2,3,4,16,17}, 
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(Tio = {1, 2, 3, 4, 13, 14, 15, 16, 17}, and (T13 = {1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17}. When 
viewing a as a topographic map for the data f , we obtain the height vector 

h{v) = (12,13,11,10,5,4,5,3,2,1,0,6,8,7,8,10,9). 

Figure [2] displays the data v and the height vector h{v) plotted around the circle. □ 




Figure 2: The data v (left) and the height vector h{v) (right) for the gene Obox. 



3 Significance testing 

Multiple hypothesis testing is of concern in microarray experiments, because the number 
of hypotheses that are tested simultaneously is large. In our application, there are N = 
13, 873 null hypotheses. The hypotheses take the form "the r genes with the smallest 
counts c arose by chance", for r = 1, 2, . . . , A^. In this section, we explain how to assign 
p- values to these groups, leading to a criterion for determining which hypotheses to reject. 

Applying the cyclohedron test to data vectors v^^\ . . . , f '•^^ in M" means computing 
their permutation counts c(f ^^)), . . . , c{v^^'^) . The highest ranked data are those for which 
c(f'^*^) is smallest. Under the null hypothesis, the probability distribution on M" of each 
data vector f induces the uniform distribution U on the n! permutations 5. Viewed as 
a random variable, the permutation count c has probability distribution function 

Pe : im(c) ^ [0, 1] , 7 ^ Pr(c(5) = 7). (1) 

Here im(c) = {71 < 72 < • ■ ■ < 7s„} is the set of all positive integers that arise as 
permutation counts c{a) for some a G C„. The probability distribution function Pc is 
displayed in Figure [3] for n = 17, which will be the number of time points in Section 4. 
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Figure 3: The probability distribution function of c for n = 17. 

We now fix two integers 1 < r < A^. The order statistic C(r) is the function M^^" 
im(c) which takes any hst of data vectors V = {v^^\ . . . ,v^'^^) and returns the r*'^ 
smallest value among the permutation counts c{v^^^), . . . , c(v^^^). Recall that under the 
null hypothesis, each f has a distribution on M" which induces the uniform distribution 
on permutations 5. Further let us assume that the data vectors w'-*-' are independent. This 
induces a joint distribution Qq on the vector ( c{v^^^), . . . , c(v^^^) ) G M^^" of counts. In 
this framework, we view the order statistic C(r) as a random variable with distribution 

F(,):im(c)^[0,l],7 ^ Pr(CM = 7). 

Qo 

In other words, F(^r) (7) is the probability that the r-th smallest value among the permuta- 
tion counts c(v^^^), . . . , c(f '^^)) of random data vectors equals 7. The function -F'(r)(7) 
depends only on n, N and r. Its efficient computation is explained In Section O 

Definition 3.1. (p- value) Suppose we apply the cyclohedron test to data vectors in 
M", and the data vector whose permutation count is the r-th smallest has permutation 
count 7fc. Then the collective p-value of the group of r highest ranked data vectors is 

Pr(C(,) < 7fc) = i^M(7i) + i^(r)(72) + ■ ■ ■ + i^M(7fc)- (2) 

Qq 

The p-value ([2]) is the probability that the r-th order statistic for random data under the 
null would be less or equal to the value of the r-th order statistic for the observed data. 
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We now offer some remarks regarding how our multiple testing procedure differs from 
those typically used when the number of hypotheses is large (say, in the thousands). To 
analyze gene expression data, it is often appropriate to employ a joint null distribution Q 
that allows for dependencies among the genes. These dependencies are unknown, so an es- 
timate of the null distribution of the test statistics (for the cyclohedron test, the vector of 
counts) is made from bootstrap samples of the data; two such estimates are the null shift 
and scale [Birkner et.al. (2005)| and null quantile van der Laan and Hubbard (2006)| dis- 
tributions. Further, there is typically one null hypothesis per gene, and p-values are 
assigned to control some general Type-I error rate. However, our chosen joint null distri- 
bution Qq is simple and can be computed exactly. In addition, this test was motivated by 
the exploratory data analysis described in the next section. Having a powerful procedure 
was not critical; rather, the aim was to identify groups of top genes for further biological 
testing. In other settings, however, different choices of joint null distribution or of multiple 
testing procedure (such as those in Birkner et.al. (2005)| ) can improve power. 



4 Application to mouse microarray data 

We applied the cyclohedron test to microarray data from recent work that investigated 
the mouse segmentation clock Dequeant et.al. (2006)| . Dequeant et al. took 17 expres- 



sion measurements from mouse presomitic mesoderm on Affymetrix MOE430A arrays. 
By independently measuring the expression of the gene Lunatic Fringe (Lfng) which is 
known to be periodic within the somitogenesis cycle of embryonic development, Dequeant 
et al. ordered the 17 experiments within the cycle. Each array consisted of over 22, 000 
probesets, however we restrict the analysis to a subset of 13, 873 probesets by removing 
genes whose expressions are deemed "absent" across the experiments by Affymetrix stan- 
dards. In other words, the data consisted of 13,873 data vectors f , each of which was the 
expression level of one gene (divided by the mean across experiments and transformed to 
log2). We then applied the cyclohedron test to these data. We were interested in those 
genes whose counts c(w) were small. Accordingly, we ranked the genes by their counts; 
Table 1 presents the first 32 genes. Table 2 lists the significance of top groups of genes. 
For example, the first 32 genes collectively have a p-value of 0.081, which suggests that 
these 32 genes are of interest. At this point we recall the definition of a p-value which was 
given in equation ([2]). The p-value of the rank-1 gene Oboxl is the probability under the 
null hypothesis that the top-ranked permutation count is less than or equal to 480, while 
the p-value of the first 16 genes (the number 0.008 in Table 2) is the probability that the 
gene ranked 16 has permutation count less than or equal to 4928. It is important to em- 
phasize that the p-values do not reveal the significance of any individual gene, but rather 
of a collection of genes. For example, the top 19 genes having a collective p-value of 0.046 
means this: the probability that the first 19 genes would collectively all have permutation 
count at most 6825 under the null distribution is 0.046. In other words, the group as a 
whole is significant. However, we determine whether any individual gene in that group 
is significant. For example, there is no significance to the fact that Oboxl is ranked first. 
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Rank 


ProbeSet 


Gene Name 


Gene Description 


Count 


1 


1456017JC 


Oboxl 


similar to oocyte specific gene 


480 


2 


1452041 


Klhl26 


kelch-like 26 (Drosophila) 


1440 


3 


1418593 


Taf6 


TAF6 RNA polymerase II 


1560 


4 


1417985 


Nrarp 


Notch-regulated ankyrin repeat protein 


1950 


5 


1436845 


Axin2 


axin2 


2240 


5 


1436343 


Chd4 


chromodomain helicase DNA binding protein 


2240 


7 


1426267 


Zbtb8os 


zinc finger and BTB domain 


2310 


8 


1420360 


Dkkl 


dickkopf homolog 1 (Xenopus laevis) 


2520 


9 


1449643^s 


Btf3 


basic transcription factor 3 


2772 


10 


1417399 


Gas6 


growth arrest specific 6 


2800 


11 


1418102 


Hesl 


hairy and enhancer of split 1 (Drosophila) 


3120 


12 


1448799^ 


Mrpsl2 


mitochondrial ribosomal protein S12 


3150 


13 


1418729 


Star 


steroidogenic acute regulatory protein 


3600 


14 


1425424 


MGC7817 


hypothetical protein LOC620031 


3850 


15 


1455740 


Hnrpal 


heterogeneous nuclear ribonucleoprotein 


4004 


16 


1450204_a 


Mynn 


myoneurin 


4928 


17 


1449120_a 


Pcml 


pericentriolar material 1 


6006 


18 


1423106 


Ube2b 


ubiquitin-conjugating enzyme E2B 


6720 


19 


1420386 


Sehll 


SEHl-like (S. cerevisiae) 


6825 


20 


1456380JC 


Cnn3 


calponin 3, acidic 


8008 


21 


1419438 


Sim2 


single-minded homolog 2 (Drosophila) 


8640 


22 


1426524 


Gnpda2 


glucosamine-6-phosphate deaminase 2 


9009 


23 


1438557^ 


Dnpep 


aspartyl aminopeptidase 


9450 


24 


1454904 


Mtml 


X-linkcd myotubular myopathy gene 1 


10500 


25 


1448951 


Tnfrsflb 


tumor necrosis factor receptor superfamily 


10530 


25 


1433952 


Tufm 


Til translation elongation factor 


10530 


27 


1422327^ 


G6pd2/ 
G6pdx 


glucose-6-phosphate dehydrogenase 2 


10725 


28 


1416295_a 


I12rg 


inter leukin 2 receptor, gamma chain 


10920 


29 


1417316 


Them2 


thioesterase superfamily member 2 


11025 


30 


1450242 


Tlr5 


toll-like receptor 5 


11232 


31 


1449164 


Cd68 


CD 68 antigen 


11340 


32 


1418337 


Rpia 


ribose 5-phosphate isomerase A 


11760 



Table 1: The 32 genes ranked highest by the cyclohedron test. Gene descriptions are 
derived from those provided by Affymetrix. The suffix "_at" was removed from each 
ProbeSet ID. 
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1 1 
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1 9 


i. .o 


1 A 


1 

i. .0 


1 P. 

i..O 


1 7 
i.. ( 


1 8 
i..o 


U Vcxl Lie 


n 97Q 




n 9zLzL 


n 904 




n n(S4 






Group 


^ Q 
i..y 


1 in 

i. . iU 


1 11 
i. . i i 


1 10 


1 1 


1 1/1 


1 1 

i.. io 


1 1 

i.. io 


p- value 


0.014 


0.005 


0.005 


0.002 


0.003 


0.003 


0.002 


0.008 


Group 


1..17 


1..18 


1..19 


1..20 


1..21 


1..22 


1..23 


1..24 


p- value 


0.047 


0.069 


0.046 


0.139 


0.165 


0.173 


0.195 


0.312 


Group 


1..25 


1..26 


1..27 


1..28 


1..29 


1..30 


1..31 


1..32 


p- value 


0.192 


0.192 


0.168 


0.159 


0.118 


0.096 


0.075 


0.081 



Table 2: Significance of top-ranked groups of genes. For example, the first 32 genes have 
a collective p-value of 0.081. 

While it appears to be the most periodic pattern in the data by our analysis, that could 
have happened by chance (p-value 0.279). A natural cutoff value is to look at the first 32 
genes because collectively they have a p-value of 0.081 (the next ten p- values are between 
0.13 and 0.30). Note that analyses of microarray data have this property, that Type-1 
errors are all but guaranteed due to the large number of genes (and thus the large number 
of hypotheses) that are tested. Our computations were performed with the statistical 
software R |R Team (2005)] , using the implementation described in the Appendix. 

Dequeant et al. performed significance testing according to a Lomb-Scargle analysis, 
and then based on gene expression profile clustering, they identified genes belonging to 
three pathways Notch/FGF and Wnt that are involved with somitogenesis. There are 
genes that are deemed interesting by both the analysis of Dequeant et al. and the cyclo- 
hedron test. For example, Axin2 is ranked highly by the Lomb-Scargle (rank 6) and the 
cyclohedron test (rank 5). In addition, nrarp (rank 4 according to the cyclohedron test) is 
ranked poorly by Lomb-Scargle (rank 482), although it belongs to the Notch pathway and 
its gene expression clusters accordingly. Finally, there are novel genes such as Obox (rank 
1 by the cyclohedron test, but not known to be related to somitogenesis) that require 
further investigation. This suggests that to find periodic gene expression, it is beneficial 
to apply many methods, including Lomb-Scargle, clustering, and the cyclohedron test. 
Doing so enables us to find genes overlooked by each method, as well as to confirm find- 
ings of other tests. In other words, the findings of each method complement those of 
others by identifying candidate genes for knockout experiments. The forthcoming paper 
Dequeant et.al. (2007)| will compare various methods, including Lomb-Scargle, up-down 



analysis, and the cyclohedron test, for identifying cyclic genes from this data set. 

In conclusion, we remark that, although microarray expression analyses are frequently 
criticized due to the noise in individual measurements, the massively parallel nature of 
the experiments provide the possibility for finding groups of significant genes. Indeed, we 
confirm this in our analysis of the Dequeant et al. experiments Dequeant et.al. (2006)| , in 



which we are unable to confirm whether any individual gene is statistically significant, yet 
we can identify a group of genes that collectively are significant. The biological significance 
of individual genes can be determined by further targeted experimental validation. 
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Figure 4: The cyclohedron C4; its vertices correspond to the distinct signatures for n = 4. 
Following the notation of Morton et.al. (2006)| , the string 324 labels the cyclohedron 



vertex of data vectors whose descent permutation is 1324 or 3124. 

5 Combinatorics of the cyclohedron test 

We now describe the combinatorics and geometry behind our test. First, the set C„ of 
cyclic signatures is in natural bijection with the vertices of a certain convex polytope. The 
ra-cycle has n{n—l) connected induced proper subgraphs, namely, the cyclic segments of 
the form S = {i,i + 1, . . . ,i + k}. Here k < n—1, and the indices are understood modulo 
n. The cyclohedron vertex of a data vector v G M" is the vector t{v) G N" whose i-th 
coordinate T{v)i is the number of cyclic segments S containing i such that Vi = minify : 
s G S}. The cyclohedron Cn is the convex hull in of all the cyclohedron vertices t{v) 
where v ranges over R". For n = 4 and the data vector v = (0.49,5.73,4.01,2.67), we 
have t{v) = (6,1,2,3), while for v' = (0.49,5.73,2.67,4.01) we have t{v') = (6,1,4,1). 
For example, t^v)^ = 2 because V3 = 4.01 is minimal in = {3} and S"^ = {2, 3}. 

Two vectors in R^ share the same signature a = {cxi, (T2, cts} if and only if they are 
mapped to the same cyclohedron vertex r. The convex hull of all cyclohedron vertices 
t{v) is the 3-dimensional cyclohedron C4. This is a simple polytope with 20 vertices, 30 
edges and 12 facets (for the 12 cyclic segments). It is depicted in Figure 3. Vertices in 
the figure, incident to a 'double' edge indicate signatures a with c{a) = 2. Thus the set 
C4 of all signatures has 20 elements, one for each vertex of C4. 

The following theorem summarizes what is known about the cyclohedron. It is ex- 
tracted from Fomin and Reading (2004), Hohlweg and Lange (2005), Markl (1999)| . 
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Theorem 5.1. The cyclohedron Cn is an {n—1) -dimensional polytope. It is the solution 
set in of the following system of one linear equation and n{n — 1) linear inequalities: 

xi+X2-\ \- Xn = n(n-l), (3) 

Xs > y j for each cyclic segment S. (4) 

The cyclohedron Cn is simple, i.e. each vertex lies on precisely n—1 facets. Each inequality 
defines a facet. The total number of vertices equals (^^jf ) ■ More generally, the number 
fi of i-dimensional faces of the cyclohedron is given by the generating function 

n—1 n—1 y 1 \ 2 

E/<-' = Err) ■<-"+i''- <5> 

i=0 k=0 ^ ^ 

Algorithm 12.11 is a greedy method for hnear programming on the cyclohedron Cn- 
Indeed, computing the cyclohedron vertex r(t>) of a data vector v = (f i, . . . , f„) is equiv- 
alent to the linear program of minimizing Yl^=i '^i^i subject to the constraints Q and 
dl]). The optimal vertex of that linear program on Cn is precisely the vector x = t{v). 

Given the linear functional Yl^=i '^i^i be minimized, Algorithm 12.11 generates a 
collection a = {ui, (T2, . . . , of subsets of {1, 2, ... , n}. These sets S = ai are cyclic 

segments, and they indicate which n—1 inequalities (jlj) are tight at the optimal vertex 
X = t{v) of Cn- This implies that t{v) can be recovered from a{v) and vice versa: 

Corollary 5.2. The cyclohedron vertex t{v) of any data vector G M" can be obtained 
from the signature a{v) = {ai, (T2, . . . , cTn-i} by solving the linear system of equations 

(I3D and ^Xs=y^^j for i = 1, 2, . . . , n-1. (6) 

seat ^ ^ 

Conversely, the signature a{v) is recovered from the vertex t{v) by substituting x = t{v) 
into the inequalities and collecting all index sets S for which equality holds. 

In light of Corollary 15.21 we henceforth shall identify signatures cr G C„ with their 
corresponding vertices r of the cyclohedron C„. We note that the solution r to (jS]) can 
be read off easily within Algorithm 12. 1[ It always holds that ts^ := (2), and the other 
n—1 coordinates are obtained by adding one line at the end of the main i loop: 

Output Ts^ = (|Left| + 1) ■ (|Right| + 1). 

Two data vectors v and v' are cyclically equivalent if and only if cr(t') = (t{v') , i.e., 
if and only if the linear functionals corresponding to v and v' are minimized at the same 
vertex t{v) = t{v') of the cyclohedron Cn- The cyclic equivalence classes are the normal 
cones at the vertices of C„. They are specified by the inequalities Vg_ < v^^ for all 
inclusions ak C in cr(f). Since Cn is simple, n—1 inequalities suffice, and these can be 
generated by augmenting Algorithm 12.11 again at the end of the main i loop, as follows: 

if Right 7^ or Left 7^ then output vs^ < vSf,- 
The generated inequalities permit the study of confidence regions for the cyclohedron test. 
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Example 5.3. Fix n = 17 and let v be the data vector for the Obox gene in Example 12 .41 
The augmented Algorithm 12.11 reveals that the cyclic equivalence class of v is given by 



f 11 < f 10 < f 9 < ^"8 < f 6 < f 5 < Vi2 < Vm < f 15 < Vir < V4 < V3 < Vi < V2 

and Vq < Vj and Vn < fie and vi/^ < vi^. 

These inequalities specify the normal cone at the vertex 

t{v) = (2,1,3,4,11,24,1,14,15,16,136,10,1,16,7,1,10) 

of the 16-dimensional cyclohedron C17. Recall that the possible signatures for data with 
n = 17 are (in bijection with) the vertices of C17, and their total number equals 

,^ , /2-17-2\ /32\ 

\Cu\ = ( J = y = 601,080,390. 

Among all these signatures, the vertex t{v) is of interest because the probability that a 
random linear functional attains its minimum over Cn at that vertex is rather small: 

p(v) = c{v)/n\ = 480/17! = 1.35 ■ 10"^^ 

The results of our analysis for the full data set were presented in Section 4. □ 

The theory of graph associahedra also offers the following combinatorial characteriza- 
tion of the possible outputs of Algorithm 12. 1[ A collection {cti, 0^2, . . .} of cyclic segments 
is called a tubing of the n-cycle if any two elements satisfy the following property: ei- 
ther (Ti C (Tj, or (Tj C (Tj, or and are disjoint and no node in cij is adjacent to a 
node in aj Each maximal tubing has the same number of elements, namely n—1, and the 
maximal tubings are precisely the signatures generated by Algorithm 12.11 The simplicial 
complex of all tubings is dual to the face poset of the simple polytope C„. Analogous 
statements hold for the face poset of the graph associahedron of any graph G with vertex 
set {1, 2, . . . , n}. The cyclohedron Cn is the special case when G is the n-cycle. 

We propose that rank tests which are associated with graphs G in this manner be 
called topographical models. This is motivated by their relationship with graphical models 
(Markov random fields) which was developed in Morton et.al. (2007)|. Our cy clohedron 



vertex map r, for G the n-cycle, was denoted t^(^q^ in Morton et.al. (2007), §5]. We 
believe that topographical models for graphs G other than the n-cycle will be useful for 
wide range of statistical problems concerning data with an underlying graphical structure. 



6 Null distribution of the counts and order statistics 

We next compute two probability distribution functions, that of the random variable c and 
of its order statistics. In the first part of this section we introduce a generating function 
that represents the distribution Pc of c under the null distribution. This is applied in the 
second part to derive the order statistics of Pc and a formula for computing the collective 
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p- values (El) exactly. Recall that the set Cn of signatures equals the set of maximal tubings 
or vertices of the cyclohedron Cn- For each a G Cn, the quantity c{a) = p{a) ■ n\ is the 
number of permutations 6 which map to r. See Algorithm 12.11 and Proposition I2.2[ 

We define the count generating function for the cyclohedron test to be the polynomial 

Tnit) := 5^t'=('^). 

By Theorem 15.11 this polynomial gives a refinement of the central binomial coefficient: 

/2n — 2\ 

r„(l) = |Vert(C„)| = ( ^_ J- 
Similarly, the first derivative r^(t) = ^r„(t) gives a refined count of the permutations: 

r;(i) = = nl 

We list the first few non-trivial instances of the count generating function: 

T^{t) = At'^ + m, 

T^it) = 20t^ + lOt^ + 40t, 

r6(t) = 12t® + 24t^ + 48t^ + 48^3 + 24^2 + 96t, 

r7(t) = 28t^° + 56t^^ + 140t^° + 28t® + 56t^ + 112t^ + 112t^ + 112t^ + 56t^ + 224t, 

r8(t) = 8t^° + 32t^^ + 128t^^ + 64t^° + 64t^^ + 64t^° + ■ ■ ■ + 256t^+ 128t^ + 512t, 

Tgit) = 72^2^° + 72t^^^ + 108t^^° + 144t^26 ^ 432^105 ^ . . . + 575^3 + 288^^ + 1152^. 

The count generation function encodes the probability distribution function of c: 

Remark 6.1. The probability Pcij) is the coefficient of in the polynomial {t/n\)-r'^{t) . 

Example 6.2. Consider the case n = 7. The S7 = 10 possible permutation counts are 

im(c) = {1,2,3,4,5,6,8,10,15,20}. 

The probability for each of these counts to be observed is the corresponding coefficient in 

^ ^' 5040 ' 9 6 18 ^ 45 45 

7Gim(c) 

For instance, the cyclohedron Cg has 56 vertices a with c((t) = 15, and this accounts for 
56 • 15 = 840 of the 5040 permutations b in S7. Thus the probability that a random data 
vector G M'' has permutation count c(t>) = 15 is equal to Pc(15) = 840/5040 = 1/6. □ 

We now describe a formula for computing the count generating function. Let 7^ 
denote the set of unlabeled rooted trees with m nodes, where each node has at most two 
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children. The number of these trees is the Wedderhurn-Etherington number^ denoted by 
tm '■= \Tm\- Starting with to = 1, the Wedderburn-Etherington numbers are 

1, 1, 2, 3, 6, 11, 23, 46, 98, 207, 451, 983, 2179, 4850, 10905, 24631, 56011, 127912, . . . 

and they can be computed by the following recursion: 

\m/2\-l 

tm = U- tm-i-1 if m is even, 

i=0 

\m/2\-l 

tm = S U■tm-^-l if m is odd. 

i=0 

This holds because each tree T in 7^ is constructed uniquely by taking an unordered pair 
consisting of a tree Ti in % and a tree T2 in %n-i-i and attaching them to a new root. 
Note that to = 1 corresponds to the case when the new root has outdegree one. We call 
('"r^) the order of the root. The node is called balanced ii i = {m — l)/2 and the two 
subtrees Ti and T2 are isomorphic. In this manner, each node of a tree T G 7^ has an 
order, and it is either balanced or unbalanced. For instance, all leaves are balanced of 
order 1, all nodes with one child are unbalanced of order 1, and nodes with two children 
have order > 2. For a tree T G Tm let unbal(T) denote the number of unbalanced nodes 
in the tree T, and let order(T) denote the product of the orders of all nodes in T. 

Theorem 6.3. The count generating function for the cyclohedron test equals 

r„(t) = n- J2 2"°^^'(^)-t°'^<^^'^(^). 

T6r„„i 

Proof. Every signature a = {(Xi, . . . , cr„_i} in C„ maps to an unordered tree T = T(cr) in 
Tn-i. If = 2 then T is the tree with one node. For n > 3 we construct T iteratively as 
in Algorithm l2.lt by induction, the sets Left and Right correspond to two subtrees Ti and 
T2, and a new root is attached to form the tree corresponding to {6i} U Right U Left. The 
order of the resulting tree T{a) equals the permutation count c{a) computed. It remains 
to be shown that the set of all signatures a which are mapped to the same tree T G T^-i 
has precisely n ■ 2"'^^'^'(^) elements. The factor n comes from the fact that the last element 
6n can be chosen arbitrarily. So, let us suppose 6n = n. Then the indices appearing in 
a are precisely 1,2, .. . ,n— 1. Let Ti and T2 be the two subtrees of the root of T, and 
suppose they have i and n — 2 — i nodes respectively. If i ^ n/2 then either 6n-i = i + 1 
and both {1, 2, ... , i} and {n— 2— i, . . . , n— 2, n— 1} are in a, or 5„_i = n—l—i and both 
{1, 2, . . . , n—2—i} and {n — i, . . . , n—2, n—1} are in a. If i = n/2 then = n/2 and 
both {1, 2, . . . , 72/2 — 1} and {n/2 + 1, . . . , n — 1} are in a. The choices for the remaining 
elements of a are constructed inductively by identifying the nodes of the two subtrees 
with these two sets. If the two subtrees are identical (i.e. the root is balanced) then 
there is only one identification to be considered, otherwise we must consider two cases. 
Proceeding in this manner along the tree, we see that there are 2^^^^^^^^ many choices of 
signatures a on {1,2, ... ,n — 1} which map to T. □ 
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We next present a recursive method for computing the count generating function r„(t). 
Let / = ^ • ttif and g = bjP be any two generating functions and M any positive 
integer. Then we define the * -product of / and g with respect to M as follows: 

f*M9 ■■= 5^a,-6,-f^-^. (7) 
Corollary 6.4. Let Qnit) be the polynomial defined recursively by 

m— 1 

Qo{t) = Qi{t) = t and ^m{t) = ^ijt) *fni-i-^ Qrn-i-i{t). 

Then r„(t) = n ■ Qn-iit) is the count generating function for the cyclohedron test. 

Proof. This follows from the recursive tree construction in the proof of Theorem 16.31 □ 

Example 6.5. Corollary 16.41 easily yields the full expansion of ^ln{t) for small values of 
n. For n = 17, the case of interest in Section 4 (see also Examples 12.41 and 15. 3p . we find 

Tnit) = 272^108108000 ^ 544^89689600 ^ 272^^6486400 ^ 544^80720640 ^ ^ 348160^^ 

+278528t^ + 417792^6 + 278528t^ + 278528t^ + 278528^^ + 139264t2 + 557056t. 

The number of terms in this polynomial equals | im(c)| = 2438. The 2438 values of the 
probability distribution function are plotted on a logarithmic scale in Figure 4. For 
larger values of n, say n > 30, it becomes infeasible to compute the expansion of r„(t), 
but Corollary 16.41 can still be used to design efficient methods for sampling from P^.. □ 

The distribution function -F(r)(7) of the order statistic C(r) is now computed. Defining 
Pi = Pcili) to be the probability under the null hypothesis that the count is equal to ■ji, 
Remark 16.11 tells us that 

{t/n\) ■ r;(t) = pit^i + p2t^^ + ■■■+ psj^'" , where 71 < 72 < ■ ■ ■ < 7s„- 

Consider the identity 

iPi+P2 + ---+Psy = Yl ( • • ^ • ]-PiP2---pt" = 1- 

By definition, F(^r){lk) is the sub-sum of all terms in this sum whose indices satisfy 

\-ik-i < r < N-ik+i is„- 

For the purpose of computational efficiency we rewrite this sub-sum as follows. The 
formula below furnishes us with an efficient method for computing the collective p- values. 
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Lemma 6.6. The probability distribution function under the null distribution Qq of the 
order statistic C(r) is given by 

TV min(r— IjAf— j) 
i=l j=max(0,r— i) 

Proof. The first sum is over the number of data points that have permutation count 7^. 
The second sum is over j, the number of data points whose permutation count is less than 
7fc. Then, the multinomial coefficient gives the possible ways to partition {1, 2, ... , A^} 
into sets of size i, j, and N — i —j; that is, it accounts for possible rearrangements among 
the permutation counts equal to, less than, and greater than 7^. The probability that 
such rearrangement occurs is the product + - ■ ■+pk-iy -pi- {pk+i + - ■ •+Psn)^~*~-'- CH 



. J (pi + • ■ ■ + Pk^iY ■ pI ■ (Pk+i + --- + Ps„) 



Appendix: R code for the cyclohedron test 

The R source code topoGraph.R is available for the cyclohedron test. The software can 
be downloaded from 



http : //bio . math . berkeley . edu/ rankt est s/ index . html 



Our code requires the free statistical software package R R Team (2005)| . Here we de- 



scribe how to perform basic tasks related to the cyclohedron test. The data file must be 
a CSV (comma-separated values) file, where the first column consists of identifying labels 
(such as gene names), and the first row labels the time points (all other rows are the 
corresponding data vectors). We illustrate the use of the basic functions with the data 
file (named '13873. csv') that we described in Section HI The first column consists of the 
ProbeSet IDs. The source code containing the R functions is topoGraph.R. First, we call 
the source code and load the data file from an R command line (here, we assume that 
both files are in the current working directory): 

source ( "topoGraph . R" ) 
dataset<-loaddata( " 13873 . csv" ) 

Next, we calculate the count of each data vector, which is done by the following command: 
counts<-cycleCounts (dataset) 

This defines "counts," a vector which lists the counts c of the data vectors in the order 
given by the data file. To list the genes according to their count ranking, as shown in 
Table 1, we call the function rankby which outputs the labels (here, the ProbeSet IDs) 
of the genes. The following command outputs the ten highest ranked ProbeSets. 

rankby(row.names(dataset) , counts) [1:10] 
[1] "1456017_x_at" "1452041_at" "1418593_at" "1417985_at" 
[5] "1436845_at" "1436343_at" "1426267_at" "1420360_at" 
[9] "1449643_s_at" "1417399_at" 

More extensive documentation is available online. 
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